Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems
نویسنده
چکیده
Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual π Learning method in light of the linear duality between the value and policy. The π learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed. For infinite-horizon undiscounted Markov decision process with finite state space S and finite action space A, the π learning method finds an ǫ-optimal policy using the following number of sample transitions
منابع مشابه
Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time
We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it finds an -optimal policy using nearly-linear run time in the ...
متن کاملRandomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time
We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it finds an ǫ-optimal policy using nearly-linear run time in the...
متن کاملStochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning
We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity p...
متن کاملDeep Primal-Dual Reinforcement Learning: Accelerating Actor-Critic using Bellman Duality
We develop a parameterized Primal-Dual π Learning method based on deep neural networks for Markov decision process with large state space and off-policy reinforcement learning. In contrast to the popular Q-learning and actor-critic methods that are based on successive approximations to the nonlinear Bellman equation, our method makes primal-dual updates to the policy and value functions utilizi...
متن کاملAccelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning
Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1710.06100 شماره
صفحات -
تاریخ انتشار 2017